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«METHOD FOR DERNING THE STRUCTURE OF A VIDEO SEQUENCE» 



FIELD OF THE INVENTION 

The invention relates to a method for defining in a hierarchical fashion the 
structure of a video sequence that corresponds to successive frames. The invention also 
relates to a method for indexing data that includes said definition method, to a device for 
carrying out said indexing method, and to an image retrieval system also implementing 
said method. This invention will be very useful in relation with the MPEG-7 standard. 

BACKGROUND OF THE INVENTION 

The future MPEG-7 standard is intended to spedfy a standard set of 
descriptors that can be used to describe various types of multimedia information. The 
description thus assodated with a given content allows fast and efficient searching for 
material of a user's interest. The invention relates more specifically to the case of 
representation of video sequences. 

A video indexing technique is described for instance in the document 
"Automatic video indexing via object motion analysis", J.D. Courtney, Pattern Recognition, 
volume 30, number 4, April 1997, pp.607-625- As explained in said document, the logical 
organization of video sequences may be determined by means of a hierarchical 
segmentation, in the same manner a text is subdivided into chapters, paragraphs and 
sentences. 

SUMMARY OF THE INVENTION 

It is an object of the invention to propose a method that is able to 
automatically create the description of a video sequence, that is to say a table of contents 
of said sequence, on the basis of a new, spedfic criterion. 

To this end, the invention relates to a method such as described in the 
introductory paragraph of the description and which moreover comprises : 

(1) a shot detection step, provided for detecting the boundaries between 
consecutive shots, a shot being a set of contiguous frames without editing effects ; 

(2) a sub-division step, provided for splitting each shot into sub-entities, 
called micro-segments ; 

(3) a clustering step, provided for creating the final hierarchical structure of 
the processed video sequence. 

Such a method allows to split each shot of the processed video sequence 
into sub-entities, here called micro-segments. Preferably, these micro-segments present, 
according to the proposed criterion, a high level of homogeneity on the motion 
parameters of the camera with which the original images have been captured (these 



images then having been converted into a video brtstream constituting said processed 
video sequence). 

More predsety, the homogeneity of eadi micro-segment is computed on a 
motion histogram, each bin of which shows the percentage of frames of the sequence 
with a specific type of motion, 

A micro-segment is perfectly homogeneous when it presents a single 
combination of camera motion parameters along all its frames, the histogram bins then 
being equal to 1 or 0. On ttie contrary, if the bins of the histogram are not equal to either 
1 or 0, i.e present intermediate values indicating that a micro-segment is not perfectiy 
homogeneous, in order to segment a shot, a distance between two segments is 
computed, based on the homogeneity of the segments union, said homogeneity being 
itself deduced from the histogram of a micro-segment and the different motion types, the 
homogeneity of a shot being equal to the homogeneity of its micro-segments weighted 
by the length of each of them, a fusion between any pair of segments being decided or 
not according to the value of the homogeneity of the shot with respect to a predefined 
threshold T(H) and assuming that the selected segments have already been merged, and 
such a possible merging process between micro-segments ending when there is no pair 
of neighbouring micro-segments that can be merged. 

It is another object of the invention to propose a video indexing device 
including means for carrying out such a method and assodated indexing means for 
adding a \abel to each element of the hierarchical structure defined thanks to this 
method. 

It is still another object of the invention to propose an Image retrieval 
system including such a video indexing device and assodated means for performing on 
the basis of the categorization Issued from said indexing operation any image retrieval 
using one or several features of this image, 

BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention will now be described, by way of example, with 
reference to the accompanying drawings in which : 

- Rg.l shows a block diagram of the definition method according to the 

invention ; 

- Fig.2 illustrates an example of mDFD curve for a given sequence of frames ; 

- ng,3 shows an example of histogram illustrating the measure of the 
segment homogeneity ; 

- Fig,4 illustrates the process of initial oversegmented partition creation ; 

- ng.5 shows a binary tree such as created by implementation of a shot 
merging sub-step provided in the definition method according to the invention ; 



- Rg.6 shows the tree yielded after a tree restructuring sub-step ; 

- Rg.7 illustrates a method for indexing data that have been processed 
according to the invention ; 

- Fig.8 illustrates an image retrieval system implementing said indexing 
method and allowing, thanks to appropriate associated means, to perform an image 
retrieval based on the categorization issued from such an indexing operation. 

DETAILED DESCRIPTION OF THE INVENTION 

The goal of a table of contents for a video sequence is to define the 
structure of this sequence in a hierarchical fashion, like in a text document. The original 
sequence is therefore subdivided into sub-sequences which can also be divided into 
shorter sub-sequences. At the end of this division process, the shortest entity to be 
described will be the micro-segment. 

More precisely, the method according to the proposed strategy is divided 
into three steps, which are, as shown in Rg.l : a shot detection step (in a sequence of 
pictures, a video shot is a particular sequence which shows a single background), a - 
segmentation step of the detected shots, and a shot clustering step. These three steps 
11, 12, 13 will now be described in a more detailed manner. 

The first step 11 is provided for splitting the video sequence into shots 
constituting the input data for the next steps. As a shot is a set of contiguous frames 
without editing effects, this step must allow to detect the transitions between consecutive 
shots, which is made by means of two main sub-steps : a computation sub-step 111, 
allowing to determine a mean Displaced Frame Difference (mDFD) curve, and a 
segmentation sub-step 112. 

The mDFD curve computed during the sub-step 111 is obtained taking into 
account both luminance and chrominance information. With, for a frame at time t, the 
following definitions : 

luminance Y = {fk(i, j, t)}k=Y 

chrominance components (U,V) = {f-^ (i, j, t)}]^_y y (2) 

the DFD is given by : 

DFDK(i,j ; t-1, t+1) = fk(i,j, t+1) - fk(iKWi,j), j-dy(i,j), t-1) (3) 
and the mDFD by : 

i Y U V ^x,Iy 
mDFD(t) = — — Z Wk i |DFDk (i, j ; t - 1, t -h 1) | (4) 
k ij 

where Ix, ly are the Image dimensions and w the weights for Y, U, V components. An 
example of the obtained curve, showing ten shots Si to Sio, is illustrated in Rg.2 with 
weights that have been for instance set to {wy, Wu, Wv} = {1, 3, 3}. The transitions 



between consecutive shots can be abrupt changes from one frame to the following one, 
or more sophisticated, like dissolves, fades, and wipes : the highest peaks of the curve 
correspond to the abrupt transitions (frames 21100, 21195, 21633, 21724), while, on the 
other side, the osdilation from frame 21260 to frame 21279 corresponds to a dissolve 
and the presence of large moving foreground objects in frames 21100-21195 and 21633- 
21724 creates high level osdilations of the mDFD curve. 

The sub-step 112, provided for detecting the video edtb'ng effects and 
segmenting the mDFD curve into shots, uses a threshold-based segmentation to extract 
the highest peaks of the mDFD curve (or another type of mono-dimensional curve). Such 
a technique is described for instance in the document "Hierarchical scene change 
detection in an MPEG-2 compressed video sequence", T.Shin and al. Proceedings of the 
1998 IEEE International Symposium on Qrcuits and Systems, ISCAS'98, vol.4, March 
1998, pp.253-256. 

The second step 12 Is a temporal segmentation provided for splitting each 
detected shot into sub-entities called micro-segments. This segmentation step, applied to 
each detected shot separately, consists of two sub-steps : an oversegmentation sub-step 
121, intended to dividing each shot into so-called microsegments which must show a 
perfect homogeneity, and a merging sub-step 122. 

In order to carry out the first sub-step 121, it is necessary to define first 
what will be called a distance, allowing to compare the microsegments, and also a 
parameter allowing to assess the quality of a microsegment or a partition (= a set of 
microsegments). In both cases, a motion histogram, in which each one of the bins shows 
the percentage of frames with a specific type of motion and defined as indicated by the 
following relation (5), is used : 

HsCi]=^ (5) 

where s represents the label of the concerned segment inside the shot, i the motion type 
(these motions are called trackteft, trackright, boomdown, boomup, tiltdown, tiltup, 
panlefl, panright, rollleft, rollright, zoomin, zoomout, fixed), Uthe length of the segment 
s, and Nj the number of frames of the segment s with motion type i (it is possible 
that Z Hs [i] > 1 , since different motions can appear concurrently). 

A segment is assumed to be perfectly homogeneous when it presents a 
single combination of camera motion parameters along all its frames, or to be not 
homogeneous when it presents important variations on these parameters. The segment 
homogeneity is computed on its histogram (relation (5)) : if a segment is perfectiy 
homogeneous, tiie histogram bins are equal to either 1 or 0, while if it is not, the bins can 
present intermediate values. The measure of the segment homogeneity is then obtained 
by measuring how much its histogram differs from the ideal one (i.e. it is computed how 



much the bins of the histogram differ from 1 or 0). The distance corresponding to bins 
with high values is the difference between the bin value and 1 ; analogously, for bins with 
small values, the distance is the bin value itself. An example of histogram is shown in 
Fig.3 : two motion types introduce some error because the motion does not appear in all 
5 the frames of the segment (panleft PL and zoomin ZI), and two other ones (boomdown 

BD and rollright RR) Introduce some error for the opposite reason. 

Mathematically, the homogeneity of a segment s is given by the 

relation (6) : 

H(s)=Ze(i) (6) 
i 

10 where : e(i) = 1 - Hs[i] if Hs[i] > 0,5 

e(i) = Hs[i] if Hs[i]<0,5 
Hs[i] = histogram of the segment s 
i = motion type. 

The homogeneity of a shot S is then equal to the homogeneity of its segments, weighted 
15 by the length of each of them/ as 

1 j=N 

KS) = — — . Z L,,H(s;) (7) 
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N 

where L(S) = SL; is the total length of the shot S and N is the number of segments 
1 

said shot contains (It may be noted that small values of H(S) correspond to high levels of 
homogeneity). The distance between two segments si and s2 Is then the homogeneity of 
20 the segments union : 

d(su S2) = H (Si U S2) (8) 
The temporal segmentation can now be resumed. The initial 
oversegmentation sub-step 121 allows to oversegment the concerned shot in order to 
obtain a set of perfectly homogeneous microsegments, which corresponds to the 
25 following relation (9) : 

H(s) = 0, whatever s included in S (9) 
An example of how to obtain this initial oversegmented partition is shown in Rg.4, with 
motion types panleft (PL), zoomout (ZO) and fixed (FIX), Si to Sy designating the 
segments (camera motion parameters may be unknown for some frames : in this 
30 example, the last frames of the shot - the segment s? - do not have any parameter 

associated). 

The merging sub-step 122 comprises the following operations : a 
computation operation 1221, in which the distance between all neighbouring segments 
(temporally connected) is computed using the equation (8) for selecting the closest pair 



■ 



of segments (for possible merging during the next operation), and a fusion decision 
operation 1222, in which, in order to decide if the selected pair of segments will be 
merged, the homogeneity of the shot (according to the equation (7)) is computed, 
assuming that the minimum distance segments have already been merged. The following 
5 fusion criterion is applied : 

merge if H(S) < tfireshold T(H) 
do not merge if H(S) > threshold T(H) 
(this fusion criterion is global : the decision depends on the homogeneity of the resulting 
partition, and not exclusively on the homogeneity of the resulting segment). If the 
10 merging is done, a new iteration starts at the level of the second sub-step (a second 

computation operation 1221 is earned out, and so on...)- The merging process ends when 
there is no pair of neighbouring segments that can still be merged. 

The third step 13 - a shot clustering step - is divided into two sub-steps : a 
shot merging sub-step 131, in which pairs of shots are grouped together for creating a 
15 binary tree, and a tree structuring sub-step 132, for restructuring said binary tree in order 

to reflect the similarity present in the video sequence. 

The shot merging sub-step 131 is provided for yielding a binary tree which 
represents the merging order of the initial shots : the leaves represent these initial shots, 
the top node the whole sequence, and the intemiediate nodes the sequence that are 
20 created by the merging of several shots. The merging criterion is defined by a distance 

between shots, and the closest shots are first merged. In order to compute the distance 
between shots, it is necessary to define a shot model providing the features to be 
compared and to set the naghbouriiood links between them (which indicate what 
merging can be done). The process ends when all the initial shots have been merged intx> 
25 a single node or when the minimum distance between all couples of finked nodes is 

greater than a spedfied threshold. 

The shot model must obviously allow to compare the content of several 
shots, in order to decide what shots must be merged and which is their merging order. In 
still images, luminance and chrominance are the main features of the Image, while in a 
30 video sequence motion is an importance source of Information due to the temporal 

evolution. So, average images, histograms of luminance and chrominance information 
(YUV components) and motion information will be used to model the shots. 

For implementing the shot merging sub-step 131, it is necessary to cany out 
the following operations : (a) to get a minimum distance link (operation 1311), (b) to 
35 check a distance criterion (operation 1312) ; (c) to merge nodes (operation 1313) ; (d) to 

update links and distances (operation 1314) ; (e) to check the top node (operation 1315). 

In the operation 1311, both the minimum and the maximum distance are 
computed for every pair of linked nodes. The maximum distance is first checked : If it is 
higher than a maximum distance threshold d(max), the link is discarded, otiierwise tiie 



link is taken into accx>unt Once all links have been scanned, the minimum distance is 
obtained. 

In the operation 1312, in onJer to decide if the nodes pointed by the 
minimum distance link must be merged, the minimum distance is compared to a 
S minimum distance threshold d{min) : if it is higher than said threshold, no merging is 

performed and the process ends, otherwise pointed nodes are merged and the process 
goes on. 

In the operation 1313, nodes pointed by the minimum distance links are 
merged- In the operatron 1314, said links are updated to take into account the merging 

10 that has been done and, once links have been updated, the distance of those links which 

point to the new mode is recomputed. In the final operation 1315, the number of 
remaining nodes is checked : if all initial shots have been merged into a single node, the 
process ends, otherwise a new iteration begins. 

The shot merging sub-step 131 may yield a single tree if all the initial shots 

15 are similar enough or a forest if initial shots are quite different. An example of binary tree 

~ for the creation of a table of contents is shown in Rg:5. Inside the leaf nodes of this tree, 
its label and, between brackets, the starting and ending frame numbers of the shot have 
been indicated ; inside the remaining nodes, the label, the fusion order (between 
parenthesis) and the minimum and maximum distance between its two siblings. 

20 The tree restructuring sub-step 132 is provided for restructuring the binary 

tree obtained In the sub-step 131 into an arbitrary tree that should reflect more clearly 
the video structure. To this end, it is dedded to remove the nodes that have been 
created by the merging process but that do not convey any relevant Information, said 
removal being done according to a criterion based on the variation of the similarity 

25 degree (distance) between the shots included in the node : 

- if the analyzed node is the root node (or one of the root nodes if various 
binary trees have been obtained after the merging process), then the node should be 
preserved and appear in the final tree ; 

- if the analyzed node is a leaf node (i.e. corresponds to an initial shot), then 
30 it has also to remain in the final tree ; 

- otherwise, the node will be kept in the final tree only if the following 
conditions (10) and (11) are satisfied : 

|d(min)[anal5rzed node] - d(min)[parent node]] < T(H) (10) 



|d(max)[analyzed node] - d(max)[paient node]| < T(H) (11) 



35 As shown in Rg.6, the tree resulting from the restructuring sub-step 132 represents more 

clearty the structure of the video sequence : nodes in the second level of the hierarchy 



(28, 12, 13, 21) represent the four scenes of the sequence, while nodes in the third (or 
occasionally in the fourth) level represent the Initial shots. 

The invention is not limited to the implementation described above, from 
which modifications or broader applications may be deduced without departing from the 

5 scope of the invention. For instance the invention also relates to a method for indexing 

data that have been processed according to the previously described method. Such a 
method, illustrated in Rg.7, comprises a structuring step 71, carrying out a sub-division of 
each processed sequence Into consecutive shots and the splitting of each of the obtained 
shots into sub-entities (or micro-segments), and a clustering step 72, creating the final 

10 hierarchical structure. These steps 71 and 72, respectively similar to the steps 11-12 and 

to the step 13 previously described, are followed by an additional indexing step 73, 
provided for adding a label to each element of the hierarchical structure defined for each 
processed video sequence. 

The invention also relates to an image retrieval system such as illustrated in 

15 ng.8, comprising a camera 81, for the acquisition of the video sequences (available in tiie 

form of sequential video bitstreams), a video indexing device 82, for carrying out said 
data indexing method (said device captures the different levels of content information in 
said sequences by analysis, hierarchical segmentation), and Indexing on the basis of the 
categorization resulting from said segmentation), a database 83 that stores the data 

20 resulting from said categorization (these data, sometimes called metadata, will allow the 

retrieval or browsing step then carried out on request by users), a graphical user 
interface 84, for carrying out the requested retrieval from the database, and a video 
monitor 85 for displaying the retrieved information (the invention also relates, obviously, 
to the video indexing device 82, that allows to implement the method according to the 

25 invention). 
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CLAIMS : 

1. A method for defining in a hierarchical fashion the structure of a video 
sequence that corresponds to successive frames, comprising the following steps : 

(1) a shot detection step, provided for detecting the boundaries between 
consecutive shots, a shot being a set of contiguous frames without editing effects ; 

(2) a sub-division step, provided for splitting each shot into sub-ent'ties, 
called micro-segments ; 

(3) a clustering step, provided for creating the final hierarchical structure of 
the processed video sequence. 

2. A method according to claim 1, wherein said shot detection step uses a 
similarity criterion based on a computation of the mean displaced frame difference curve 
and the detection of the highest peaks of said curve. 

3. A method according to anyone of claims 1 and 2, wherein said sub-division 
step uses a criterion involving the level of homogeneity on the motion parameters of the 
camera used to generate the processed video sequence. 

4; " A method according to daim 3, wherein the homogeneity of a micro- 

segment is computed on a motion histogram, each bin of which shows the percentage of 
frames with a spedfic type of motion. 

5. A method according to claim 4, wherein, if the bins of the histogram are not 
equal to either 1 or 0, i.e. present intermediate values indicating that a micro-segment is 
not perfectly homogeneous, in order to segment a shot, a distance between two 
segments is computed, based on the homogeneity of the segments union, said 
homogenefty being itself deduced from the histogram of a micro-segment and the 
different motion types, the homogeneity of a shot being equal to the homogeneity of its 
micro-segments weighted by the length of each of them, a fusion between any pair of 
segments being decided or not according to the value of the homogeneity of the shot 
with respect to a predefined threshold T(H) and assuming that the selected segments 
have already been merged, and such a possible merging process between micro- 
segments ending when there is no pair of neighbouring micro-segments that can be 
merged. 

6. A method for indexing data available in the form of a \ndeo sequence that 
corresponds to successives frames, comprising the following segmentation steps : 

(1) a structuring step, provided for sub-dividing said sequence into 
consecutive shots and splitting each of said shots into sub-entities called 
micro-segments ; 

(2) a clustering step, provided for creating on the basis of said segmentation 
a final hierarchical structure of the processed video sequence ; 

(3) an indexing step, provided for adding a label to each element of said 
hierarchical structure. 



7, A video indexing device including means for carrying out a method according 
to claim 6. 

8. An image retrieval system including : 

(1) means for carrying out a method according to claim 6, for defining in a 
5 hierarchical fashion the structure of a video sequence that corresponds to successive 

frames and giving an indexing label to each element of the hierarchical structure thus 
defined and storing said labels ; 

(2) means for performing on the basis of the stored labels any image 
retrieval using one or several features of said image to be retrieved. 



Abstract 

The invention relates to a method intended to automatically create a 
description of a video sequence - i.e. its table of contents by means of an analysis of 
said sequence. The main step of said method is a temporal segmentation of the video 
5 shots of the sequence, using camera motion parameters. This segmentation uses a 

similarity criterion involving, for sub-entities of each shot, the level of homogeneity of 
these sub-entities on the motion parameters of the camera used to acquire the original 
images and generate the bitstream constituting the processed sequence. 

10 Rg.2 
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